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Abstract 

Ensemble methods for supervised machine learning have become popular due to their ability 
to accurately predict class labels -with groups of simple, light-weight "base learners." While 
ensembles offer computationally efficient models that have good predictive capability they tend 
to be large and offer little insight into the patterns or structure in a dataset. We consider 
an ensemble technique that returns a model of ranked rules. The model accurately predicts 
class labels and has the advantage of indicating -which parameter constraints are most useful 
for predicting those labels. An example of the rule ensemble method successfully ranking rules 
and selecting attributes is given "with a dataset containing images of potential supernovas -where 
the number of necessary features is reduced from 39 to 21. We also compare the rule ensemble 
method on a set of multi-class problems -with boosting and bagging, -which are t-wo -well kno-wn 
ensemble techniques that use decision trees as base learners, but do not have a rule ranking 
scheme. 



1 Introduction 

Machine learning algorithms are popular tools for classifying observations. These algorithms can 
attain high classification accuracy for datasets from a -wide variety of applications and -with complex 
behavior. In addition, through automated parameter tuning, it is possible to gro-w po-werful models 
that can successfully predict class affiliations of future observations. A disadvantage, ho-wever, is 
that models can become overly complicated and, as a result, hard to interpret and expensive to 
evaluate for large datasets. Ideally -we -would like to generate models that are quick to build, cheap 
to evaluate, and that give users insight into the data, similar to ho-w the size of coefficients in a lin- 
ear regression model can be used to understand attribute-response relationships and dependencies. 



Ensemble methods are a class of machine learning algorithms that develop simple and fast algo- 
rithms by combining many elementary models, called base learners, into a larger model. The larger 
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model captures more behavior than each base learner captures by itself and so collectively the base 
learners can model the population and accurately predict class labels [H]. Classical decision tree 
ensemble methods, such as bagging and boosting, are well known and have been tested and refined 
on many datasets [ll|8l[22]. In one such study, Banfield et al. [l] studied the accuracy of boosting 
and bagging on a variety of public datasets and found that in general neither bagging nor boosting 
was a statistically significantly stronger method. 

In this paper, we modify, extend, and test an implementation [21J of the rule ensemble method 
proposed by Friedman and Popescu [16] for binary classification with bagging and with boosting. 
The Friedman and Popescu rule ensemble method is attractive, as it combines the rule weighting 
or variable importance that regression provides with the quick decision tree methods and collective 
decision making of many simple base learners. The method builds rules, that take the form of 
products of indicator functions defined on hypercubes in parameter space. The rules are fit by 
growing decision trees, as each inner node of a tree takes the desired form of a rule. The method 
then performs a penalized regression to combine the rules into a sparse model. The entire method 
resembles a linear regression model, but with different terms. Many ensemble methods provide 
little insight into what variables are important to the behavior of the system, but by combining the 
rules with regression, the rule ensemble method prunes rules of little utility and ranks remaining 
rules in order of importance. 

We also modified the rule ensemble method to use various coefficient solving methods on a set of 
binary and multi-class problems. Previous implementations of this algorithm are either currently 
unavailable [TT] or have not been fully tested on a wide set of problems [7]. We extended the 
rule ensemble method to multiple class classification problems with one versus all classification 
|24j and tested it on classical machine learning datasets from the UC Irvine machine learning 
repository [3]. These datasets were chosen because they have been used to test previous tree 
ensembles [H [8l \T8\ l22l [29] and countless other machine learning algorithms. Finally, we look at 
different methods that can be used to solve for the coefficients and show how one can use the 
rule ensemble method to reduce the dimension of a problem. We give an example of identifying 
important attributes in a large scientific dataset by applying our techniques to a set of images of 
potential supernova [4]. 

1.1 Overview of Rule Ensemble Method 

Suppose we are given a set of data points {xj,yj}^]^, where Xj denotes the ith observation, with 
label Ui. Each of the observations, x € M^^, has % attributes or feature values that we measure for 
each observation. The matrix X will denote the entire set of all Xj's. The jth feature of the ith 
observation is the scalar Xij. Our goal then is to be able to predict what class y a future unlabeled 
observation x belongs to. The method below focuses specifically on the binary decision problem 
where y can be one of only two classes {— To classify observations we seek to construct a 
function F(x) that maps an observation x to an output variable y = F(x) that predicts the true 
label y. 

Define the risk of using any function that maps observations to labels as 



R{F) = E^,yL{y,F{jc)) 
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where Ey_^y is the expectation operator. L{y, y) is a chosen loss function that defines the cost of 
predicting a label y for an observation when the true label is y. While various loss functions have 
been developed, in practice we will use the ramp loss function as it is particularly well suited to 
the binary classification problem we consider [14^ I16j . Within this framework we seek to find a 
function, i<'*(x), that minimizes the risk over all such functions 

F*(x) = argmin E^^yL{y, F{x)). 
F 



The optimal i^*(x) is defined on the entire population. However, we only have a training sample of 
observed data S = {xj,yj}^]^ so we will construct an approximation -F(x) to -F*(x) that minimizes 
the expected loss on this training set. We assume that the model -F(x) has the form of a linear 
combination of K base learners {fk{^)}k=v 

K 

F(x;a) =ao + ^afc/fc(x). (2) 

fc=i 



The next step is to find coefficients a = {ai, 02, . . . , ax} that minimize the risk ([T]). Like -F*(x), 
the risk is defined over the entire population, so we will use the approximation a* that minimizes 
the risk over the given sample set of observations S. In particular, we take a* to be the solution 
of 

a* = argmin ^s-^^(y,^(x; a)), (3) 

{a} 

N 



argmin^ Viv(yi,F(xi; a)), (4) 
I N / K \ 

-i;^^L\yi,ao + ^akfk{'^i)\. (5) 



1=1 \ fc=i 



If the loss function, L, is taken to be the mean squared error then this is simply a linear regression 
problem. 



In many cases, a solution to equation ([5]) is not be the best for constructing a sparse interpretable 
model or a predictive model that is not overfit to the training data. Instead, one would like to have 
a solution that has as few components as possible. To achieve a sparse solution, a penalty term 
can be included that prevents less influential terms from entering the model. Here, we use the 
(lasso [26]) penalty and the approximation a, which is the solution to the penalized problem 

iv / K \ K 

a = arginin^L y^, oq + ^ afc/fe(xi) +A^|afc|. (6) 
^""^ i=i V fc=i / fc=i 

The impact of the penalty is controlled by the parameter A > 0. This penalized problem has 
received a great deal of attention [Qj [131 [Hj aiid enables both estimation of the coefficients as well 
as coefficient selection. 
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This section provided a brief introduction to the methods used in this study and that were developed 
by Friedman and Popescu [141 [T5l I16j . Other papers provide more details and and justification 
of the rule ensemble method [141 [16] as well as the method that is used to assemble the rules in 
the latter part of the algorithm [15]. Additional sources also provide more details for the other 
algorithms that we employed to compute the coefficients [TSj [T71 [28] . 

In section [21 we will discuss how to build base learners fk- Section [3| will provide more details on 
the regression method used to solve equation Sections [5l[6| will present computational results 
comparing the rule ensemble method with other ensemble methods. 

2 Base Learners 

The base learners fk in equation (??) can be of many different forms. Decision trees, which have 
been used alone as classification models, have been used as base learners in ensemble methods such 
as bagging, random forests, and boosting. Decision trees are a natural choice to use for a learner, 
as many small trees (meaning each tree has few leaves) can be built quickly and then combined into 
a larger model. The bagging method grows many trees, then combines them with equal weights 
[5]. Boosting is more sophisticated as it tries to build the rules in an intelligent manner, but it still 
gives each tree an equal weight in the ensemble [10] . 

2.1 Using Rules as Base Learners 

In the rule ensemble method, simple rules denoted by are used as the base learners and take the 
form 

^fc(xi) = G Pkj), (7) 

j 

where I{xij G pkj) is an indicator function. The indicator function evaluates to 1 if the observed 
attribute value Xjj is in the parameter space defined by p^j , and if the observation is not in that 
space. Each pkj is a constraint that the fcth rule assigns to the jth attribute. For convenience 
we will denote = {pki, • • • ) to be the vector of parameter constraints that an observation must 
meet to have the kth rule to evaluate to 1. Note that a given rule can have multiple constraints 
on a single attribute, as well as a different number of constraints (indicator functions) than other 
rules. To emphasize that each rule is defined by a set of parameters we can write rfc(xi) = rfc(xj; p^). 

To fit a model we need to generate rules by computing parameter sets {pk}k=i- this study, we 
will use decision trees to generate rules, where each internal and terminal node (not the root node) 
of a decision tree takes the form of a simple rule defined by ([7]). Having rfc(xj;pfc) = 1 means 
that the fcth rule is obeyed by the ith observation and that it was sorted into the fcth node of the 
decision tree that generated the rule. 

2.2 Tree Construction - Rule Generation 

Decision trees are built using the CART (Classification and Regression Trees) algorithm [6J, which 
is summarized Table [1] and outlined below. We let 

T _ r„»ni 2(im-l) 
•>m — \i j fj=i 
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denote the set of rules contained in the mth tree which has tm terminal nodes. Let 



2(t,n-l) 



denote the prediction that the mth tree makes for observation x,; it is the evaluation of the rules 
in 7rn on Xj. 

Each tree is built on a random subset of observations Smirj) C {xj, yi}^^, as training on the entire 
dataset can be expensive as well as overfit the tree, r/ is a parameter that controls the diversity of 
the rules by defining the number of observations chosen to be in the subset Smi'H)- As subset size 
r] decreases, diversity increases with potentially less global behavior getting extracted. Diversity 
between the trees can also be increased by varying the final size of each tree. Clearly larger trees 
include more precise rules defining terminal nodes and thus are inclined to overtrain, but confining 
the size of a tree too strictly can prevent it from capturing more subtle behavior within the dataset. 
To avoid under or overfitting, we grow each tree until it has tm terminal nodes, where tm is drawn 
from an exponential distribution. The distribution has mean L, which does have to be selected a 
priori. The size of a tree is determined by growing each branch until no further nodes can be split 
because one of the following termination conditions has been met: 

1. The number of observations in a terminal node is less than some selected cutoff, 

2. The impurity of a node is less than a selected cut off, 

3. The total number of nodes in the tree is greater than tm- 

The splitting attribute and value is chosen as the split that minimizes the sum of the impurities 
(variance of the node) of the two child nodes if that split were taken. For each split only a random 
sample of attributes are considered in order to both increase the diversity of learners and decrease 
training time for huge datasets. 

2.3 Gradient Boosting 

To avoid simply regrowing overlapping rules, with no further predictive capability, we use gradient 
boosting to intelligently generate diverse rules. With gradient boosting, each tree is trained on the 
pseudo residuals pm of the risk function evaluated on the test set rather than training directly on 
the data |18j . The ith element of the pseudo residual vector in the mth iteration is given by 



for all Xj € S{r])m- Each pm is a vector with as many entries as there are observations in the 
subsample Sm{r]) on which it is evaluated, i^m(x) is the memory function at the mth iteration. It 
gives a label prediction based on all the previous learners (trees) that were built. Note that Fm(x) 
is an intermediate model of trees that is used in rule generation, while F(x.) is the final prediction 
model that has rules as linear terms. Training on the pseudo residuals allows one to account for 
what the previous trees were unable to capture. This method is similar to the method of regressing 
on residuals in multidimensional linear regression. Using pseudo residuals also provides another 
termination condition. If the pseudo residuals shrink below a chosen value, enough behavior has 
been captured and no further rules are generated. A shrinkage parameter, < < 1, controls the 
dependency of the prediction on the previously built learners. Using 1^ = results in no dependence 



dL{yi,F{^i)) 



(8) 



Pmi — 
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Rule Generation Algorithm 

Input: data {xi,yi}^^^i where Xj G M'^ and y-i G M 

TV 

Fo(xi) = argmin ^^^(yijc); cGR 

i=l 

For m = 1 . . . M 

select random subset Smiv) C {x.j}^-|^ 

select number of terminal nodes tm ~ exp(Z) 

calculate pseudo residuals pm with ([8]) 

build tree on {yi,Pm,j} for ah i G 5m(r?) 

update Fmi^i) = Fm-i{xi) + z^r(xj) 
End if |pm|oo small enough 

total rules > max 
Return: rules = {internal nodes of 7rn}m=i 

Figure 1: Outline of how to generate rules 

on past calculations, so that the next rule is built directly on the data labels and have had no part 
of the labeled value "accounted for" by dependence on previous calculations. 



3 Weighting Rules 



To combine the rules into a linear model, we need to approximate the coefficients a defined in equa- 
tion ([6|). Here we implement a method that approximates a with an accelerated gradient descent 
method developed by Friedman and Popescu ^L5\ and summarized in Figure [2j We will refer to 
this method as Pathbuild, as it does not solve ([6]) explicitly, but rather constructs a by starting 
with a null solution and then incrementing along a constrained gradient descent path, distinguished 
by a parameter r. Alternative algorithms for approximating a will be discussed and compared later. 



We would like find a value for the lasso penalty that yields the sparsest solution to ([6]) while 
maintaining a model with high accuracy. We initialize the coefficients to and find the constant 
intercept oq by the value that minimizes 

N 

ao = argminy'L(yi,a). 

1=1 

This may be better understood by considering that a will be the mean of (yi, . . . ,yN) when the 
loss is mean squared error. We approximate a iteratively and calculate the / + 1st iteration, a'"*"^, 
by taking 

d 1 ^ 

i=l 

r = {fe:(7fe(X;a^) = ||<7fe(X;a^)||oo}. 

We update the coefficients by 

' l4 + %.(X;a^) iikek*, ^ ' 
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Pathbuild: Gradient Regularized Descent Algorithm 



N 



set constant ao = min > L{yi, a) 



a 



al = 0, k = l,...,K 



For i = 1, . . . , max iterations 



gfe(x; a^) = — — ^ L(2/„ F(xi; a^)) 





+ (5(7fc(x; a^) otherwise 



Stop if risk increased 

I > max iterations 
gradient < tolerance 



Figure 2: Outline of Pathbuild method 



where gfc(X;a^) is the gradient of -F(x) calculated with the ^th iteration and evaluated on the 
entire dataset. The scaling parameter 6 > can be set constant or chosen at each step in a clever 
manner. Note that in equation ([9]) only a single component of the coefficient vector is updated at 
any iteration and thus only a single rule is able to enter the model at an iteration. The method 
only progresses in the direction of rules which have a large effect on the predictive capability and 
avoids steps that are of trivial effect. This condition may be relaxed by incrementing all of the 
components of a that have a sufficiently large gradient 



The parameter r € [0, 1] controls how large a component of the gradient must be relative to the 
largest component in order for a coefficient to be updated. Computing the gradient is expensive, 
but reorganizations and intelligent approximations to accelerate the computation are presented for 
three different loss functions in the appendix [15j. The tricks used for this "fast" method are most 
effective for ramp loss and make Pathbuild a particularly attractive method. 

In sections 16.2116.41 we will compare Pathbuild with three different algorithms that can be used 
to solve for the coefficients. Each algorithm uses a slightly different formulation of the problem 
defined in equation ([6j) and a different technique to encourage a sparse solution that also has little 
risk. The three algorithms also use mean squared error to define loss rather than the ramp loss 
function that we use in Pathbuild. 

4 Datasets and Methods for Experiments 

To test the behavior of the rule ensemble method on a binary classification problem, we used a 
dataset of images taken by a telescope [H [201 [23], t^is goal being to identify potential supernovas. 
The initial data had three images for each observation. Those images were processed to yield 39 
statistics for each observation that described the distribution and color of individual pixels within 
the original three images. These statistics became the attributes for the dataset and the observations 



k* = {k : c/fe(x;a^ 



) > r||5rfc(x;a^ 



)lloo}. 
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Set Name Attributes Observations Classes 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 



breast-w 
glass 



9 
9 

34 
4 
16 
5 
8 
60 
18 
21 



699 
214 
351 
150 
10992 

5404 
768 
208 
846 

5000 



2 
7 
2 
3 
10 
2 
2 
2 
4 
3 



ion 



ins 



pendigits 
phoneme 
pima 



sonar 



vehicle 
waveform 



Table 1: Description of UC Irvine datasets used to compare ensemble methods on multi-class 
problems. 

were labeled with +1, -1 if they were or were not, respectively, an image of a supernova- like object. 
The dataset contains a total of 5,000 positive and 19,988 negative observations. 

To test how the rule ensemble works on the binary classification problem, we use a procedure that 
first randomly selects 2,500 positive observations and 2,500 negative observations for a training 
set, and then uses the remaining data for the testing set. This selection process is repeated 10 
times for cross-validation. False positive and false negative error rates were used to assess the 
accuracy of the methods in addition to the overall error rate. The false positive rate is the ratio of 
observations misclassified as positive to the total number of negative observations in the test set, 
while the false negative rate is the ratio of observations misclassified as negative to the number of 
positive observations in the test set. The overall error rate is the ratio of observations misclassified 
to the total number of observations in the test set. The experiments show the effect of the rule 
complexity (tree depth), number of rules available (tree size), and r thresholding in Pathbuild on 
the accuracy of the method. We also consider the effect of substituting different coefficient solvers 
in place of Pathbuild. 

To assess the overall utility of the rule ensemble we extend our numerical experiments to multi-class 
problems, which are described in section [5j We compare the rule ensemble with classical bagging 
and boosting methods by testing all three algorithms on 10 datasets from the UC Irvine Machine 
Learning Data Repository [3] with five 2-fold cross-validation tests. A 2-fold cross-validation test 
is similar to the method described above except that the dataset is split into equally sized subsets 
with the proportion of observations in each class the same in both subsets. Then one set is used for 
training and the other for testing, and then the sets are switched and retrained and retested. The 
datasets are briefly described in Table [TJ The UC Irvine sets are chosen since they have been used 
in many machine learning studies ^i22j and are used by Banfield et al. [Ij to compare bagging with 
boosting. The UC Irvine sets are taken from a wide variety of applications, so they also present a 
good breadth of data to test the versatility of methods. 

Experiments using the rule ensemble method were run using Matlab™7.10 on a MacBook Pro 
with a 2.66 GHz Intel Core i7 processor. 



8 



5 Multiple Class Classification Results 



The rule ensemble method is designed for binary classification problems, but many datasets contain 
multiple classes that one needs to identify. To be applicable to classification in general, we need to 
extend the rule ensemble to many class problems. Decision trees easily extend to multiple classes 
but the regression performed to assemble the rules in the rule ensemble prevent the rule ensemble 
from being extended to classification problems where the classes are not ordered. To identify multi- 
ple classes with the rule ensemble method we use the one-versus-all (OVA) classification technique 
that has been used for successfully extending many binary classification algorithms into multi-class 
algorithms [19\ [25] . Other methods for extending binary classification algorithms to multiple class 
problems exist, such as all-versus-all classification. However, these methods require a large number 
of models to be built and are thus more expensive than OVA and frequently provide no more utility 
than the OVA classification method |24j . 

For a problem with J classes, OVA classification performs J binary tests, where the jth test checks 
if an observation is a member of the jth class or not the jth. Each observation gets a vector label 
prediction y G M*^, where each entry yj is from the binary test classifying the jth class versus any 
other class. The prediction y is a vector of -I's with a single positive entry. The index of the 
positive entry is the class that the observation is predicted to be from. 

To extend the rule ensemble method we perform J binary tests and each test returns a real valued 
prediction Fj. In the binary problem the label yj is predicted to be the sign of the real value 
returned. However, in this setting it is possible that one of the binary models will misclassify the 
observation and result in Fj being positive for more than one value of j. If we just took the sign 
of each Fj then we would have a vector y with multiple positive entries, indicating the observation 
was in multiple classes. In the event that Fj is positive for more than one value of j, we take 
the prediction to be the class that has the most definitive prediction, i.e. the class j* where Fj* 
is greater than any other class label prediction. Choosing the largest label prediction is sensible, 
since the more confident the algorithm is that the observation is in a certain class, the closer to 1 
the label prediction will be. The closer to a class prediction is, the less certain the algorithm is 
of the observation's class affinity. 

Here we compare the rule ensemble method, using Pathbuild, with results from bagging and boost- 
ing tree ensemble methods. To compare we employ 10 datasets from the UC Irvine data repository 
[3] and the testing method parameters previously used to compare various ensemble methods [1]. 
Bagging uses 1000 trees, boosting uses 50 and both employ random forests for growing trees in 
five 2-fold cross validations. Tree ensemble labels can be estimated by a voting procedure, the 
prediction is the class that most of the trees predict the observation to be part of, and an averaging 
procedure, the label is the average of the the predictions made by all the trees. Results for both 
methods are presented. Minimal tuning was used to run the rule ensemble method on different 
datasets. 
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Figure 3: Binary tests on each class of vehicle data. Figure shows accuracy when using bagging in 
an OVA classification method rather than with multi-class decision trees. 



5.1 Results Using OVA Classification on Vehicle Dataset 

Figure [3] compares using the rule ensemble and bagging on the vehicle dataset. Bagging here is used 
in an OVA classification scheme rather than in its standard, direct multiple classification method. 
The error at predicting any given label in the set is shown. As can be seen in Figure [3l the rule 
ensemble beats bagging for the majority of the classes. Figure [3] also shows the varying level of suc- 
cess that the ensemble techniques had at predicting each class. Some classes are easier to identify 
than others (e.g. "opel" is easier to distinguish than van). Different ensembles were better suited 
to one class versus another, and which ensemble was better for a class was not consistent for all 
classes in a dataset. 



5.2 Results Using OVA Classification on All Datasets 

The results of the multiple class tests are given in Table [2j The rule ensemble is much stronger than 
the tree ensembles if averaging of each tree's label prediction is used for classification. However, if 
the trees vote on which class label is best, then the rule ensemble is better on some datasets but 
not others. Voting clearly was better at label prediction than averaging base learner predictions, 
but neither boosting nor bagging provided a universal win over the rule ensemble, as can be seen 
in Figure HI What is not apparent in Table [2] is that the rule ensemble was a much better predictor 
for binary labels than the tree ensembles. This result is apparent in Figure [3] where nearly every 
individual class is better predicted by the rule ensemble method. Figure [5] shows the accuracy of 
the rule ensemble method with different coefficient solvers. Some datasets are easier to classify 
(larger percent of data correctly classified) while others, such as the #2 dataset glass, were more 
difficult to classify for all the methods. 
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Rule Ensemble 




Bag 


ging 


Boosting 


Name 


# Classes 


Pathbuild 


Fpc SPGLl 


Voting 


Average 


Voting 


Average 


breast-w 


2 


4.34 


4.60 


4.75 


3.51 


5.97 


3.34 


10.93 


glass 


7 


37.99 


33.47 


35.26 


26.92 


39.16 


29.83 


54.72 


ion 


2 


9.97 


10.43 


9.23 


7.01 


13.44 


7.01 


24.49 


iris 


3 


4.80 


4.27 


5.33 


4.93 


5.51 


5.73 


5.95 


pendigits 


10 


6.94 


5.65 


6.10 


1.23 


7.05 


0.87 


25.68 


phoneme 


2 


14.97 


14.33 


14.16 


12.06 


16.97 


10.81 


26.61 


pima 


2 


24.45 


25.76 


24.56 


23.65 


30.13 


26.22 


38.78 


sonar 


2 


22.76 


21.14 


20.67 


23.82 


33.62 


23.74 


39.70 


vehicle 


4 


28.35 


26.69 


27.63 


26.24 


34.05 


25.18 


46.36 


waveform 


3 


15.50 


15.79 


16.03 


15.67 


26.30 


16.61 


35.26 


Number of wins 


1 


1 


1 


3 
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Table 2: Error rate of the rule ensemble method compared with that of bagging and boosting. 
Error rate is given as the percent of observations in the test set that were misclassified. 



6 Binary Classification Results 
6.1 Rule Ensemble with Pathbuild 

Our implementation of the algorithm Pathbuild for approximating the rule coefficients in the rule 
ensemble method is described in Figure [2j The coefficients are found by solving equation ([5]) with 
a constrained gradient descent method. In this method, each iteration only advances in directions 
where the components of the gradient have magnitude greater than some fraction r G [0, 1] of 
the absolute value of the largest gradient component. Note that the set of directions we advance 
in, 

{A;: |5fe(X;aO| >r*||5fe(X;a^)|U}, 

can change at every iteration. By not advancing in directions that have little change in the risk 
function, the expense of updating coefficients for variables of little importance is avoided. Not 
updating rules of little importance prevents the coefficient value for that rule from "stepping" off 
zero, so that variable is effectively kept out of the model, allowing for a simpler model. Lower 
values of r should include more rules in the model. The most inclusive model is when r = 0, which 
is equivalent to using a basic gradient descent method to get a standard regression. Larger values 
of r decrease the total number of rules used in the model. The most constrained model occurs 
when T = 1. 

Effect of Number of Rules and Tree Size 

In Figure [6] we see how the size of the trees and the number of rules used for the model affect the 
accuracy of the model. The decision trees are used to generate rules. Larger decision trees yield 
more complex rules than small trees because large trees have nodes that are deeper. Nodes deep in 
a tree capture subtle interactions within the training data since they depend on more splits and are 
more complex than nodes that are closer to the root node. Figure [6] shows that ensembles built with 
larger trees have higher error rates than ensembles that use smaller trees. The increase in error rate 
when larger trees are built shows that when the model uses more complex rules, the model overfits 
the training data. However, the size of the trees does not have a strong effect on the how large of an 
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Figure 4: Comparison of the error rate from a model that was generated with the rule ensemble 
method with the error rates from models that were generated with boosting and bagging ensemble 
methods. Results are summarized in Table [21 



error rate the rule ensemble has. Further, the accuracy of the rule ensemble is highly variable and 
the variance increases when larger trees are built. Ensembles built with trees that average 40 leaves 
had 4-7% error, which is a large range when one considers that the mean classification error is only 
about 5.5%. This error is larger than and has more variance than the error when trees with an av- 
erage of 5 leaves are built, which is 3-4.2% error. It is not clear why there is so much variance in the 
error rate in general. One should recall that the average number of terminal nodes in the decision 
trees are exponentially distributed, only the mean of the distribution is changed, so there is a variety 
of sizes of trees in each ensemble and complexity between rules in each ensemble. Because there is 
a variety of tree sizes there is some stability in the error rate as the mean size of the trees is changed. 



The bottom of Figure [6] also shows that using more rules can decrease the mean error rate of the 
rule ensemble method as well as the variance in the error rate. Increasing the number of rules built 
from 100 to 600 allowed the ensemble to capture more behavior and, as a result, nearly halved the 
error rate of the method. However, the error rate only decreases down to a certain point, after 
which adding more rules does not improve the model. For our data set, the error decreases to under 
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Figure 5: Comparison of the error rate on 10 different datasets from models that were built with 
the rule ensemble method using different solvers are used to find the coefficients a. 



5.0% when 600 rules are built, but does not continue to decrease substantially when more than 600 
rules are used. We also see that the error rates between ensembles that are built on more rules 
have less variance than the error rates from ensembles that are built out of fewer rules. This result 
is reasonable, as having more rules gives the ensemble a better chance of finding good rules that 
successfully separate the data into classes. 



In the initial tree building phase, a subsample of data is selected and a tree is grown on each 
random subsample. Our initial experiments took subsamples of 2,500 observations (25% of the 
total number of observations in the training set). When we decreased the subsample size to 500 
observations (10% of training set), error rates did not significantly change even for a variety of 
tree sizes that had between 5 and 80 terminal nodes. The lack of significant difference indicates 
that 500 observations give us a large enough sample to catch the same amount of behavior that is 
captured when larger subsamples of data are used to build each tree. 

Effect of Using Rules Versus Linear Terms 

In Figure [7] we see the effect of allowing the model to have linear dependencies on individual features. 
If only linear terms are used, then the model is a standard multiple linear regression. Allowing the 
model to be built with both linear terms and the rules generated by the trees yields a mixed model. 
Using rules for the regression terms provides a clear advantage over the standard regression model 
by reducing the error rate from nearly 30% error to less than 5%. The linear regression is also more 
biased in its error than the rule ensemble. This bias can be seen by the false negative rate being 
close to zero; this means nearly all the error is caused by mislabeling observations with negative 
labels. We would not expect a linear regression to capture any of the complex nonlinear behavior 
in the dataset, and the error rates show that such an conjecture is correct - rules are needed to get 
significant predictive capability. 

Effect of Using the r Threshold as Penalty 

The variable r controls how many directions are updated at each iteration of Pathbuild in the 
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Figure 6: The figure at the top shows that growing large trees (complex rules) increases the error 
rate. The bottom figure was made by growing trees with an average of 50 terminal nodes and shows 
that ensembles that have more rules have lower error rates. Tests were run with 500 maximum 
rules in each model. The r tolerance was 0.5. Asterisks indicate the mean error rate from multiple 
tests. 
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Figure 7: Using rules in an ensemble was six times more accurate than only using linear terms, 
a classical multiple linear regression of the labels on the attribute variables. Linear regression was 
not reliable for predicting the labels, but using the rule ensemble allowed for only 5% error in 
prediction. This experiment was run using Pathbuild to solve for coefficients. 

thresholded gradient descent method. The results of increasing r are shown in Figure [8l The model 
becomes less accurate and the variance of the error rate increases, as r increases. An increase in 
T causes a higher threshold that results in fewer terms being included in each iteration of the 
coefficient finding method and a ensemble model that is less accurate. It is interesting to note 
that within a certain range, decreasing r further does not offer much increase in the predictive 
capability of the model. In this example, we see that when r is between and 0.3 there isn't a 
large increase in error rate. This indicates that using a weaker threshold of r = 0.3 or even r = 0.4 
will not significantly compromise the accuracy of our model. This is a good result, as using a larger 
threshold decreases the computational expense of each iteration of the gradient descent method. 
The result that r = 0.3 produces similar error rates to using r = means that we can get the same 
accuracy with less computation. 

6.2 Rule Ensemble with Glmnet 

In this experiment we use the Glmnet package |13j . which returns approximations to solutions of 
elastic-net regularized general linear models, to solve for the coefficients a within the rule ensemble 
method. Glmnet approximates a solution to the least squared error regression subject to an elastic 
net penalty, which is 

min4l|Xa-y[|2 + AP„(a), (10) 
with a coordinate-wise gradient descent method [13]. The elastic net is defined as 

Pa{x) = a\\ai\\i -|- (1 — a)[|a|[2 

for a S [0, 1]. When a = the problem is referred to as ridge regression, and when we set a = 1 
we get the same problem as in equation ([6]). The coordinate- wise gradient descent method starts 
with the null solution, similar to Pathbuild, then cycles over the coefficients and uses partial 
residuals and a soft-thresholding operator to update each coefficient one by one [12]. Glmnet 
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Figure 8: Error rate increases as r increases and restricts the number of coordinates the algorithm 
advances in at each iteration. This experiment was run with each tree having an average of 20 
terminal nodes and 600 maximum rules. 

has some modifications that also allow some parameters to be associated with and updated at the 
same time as neighboring parameters. The null solution corresponds to solving equation (jlOp with 
X = oo. As the coefficients are updated, A is decreased exponentially until the lower bound Xmin, 
the desired and pre-specified penalty weight, is met. Glmnet calculates a set of coefficients along 
each increment of the path A = cxd to A = Xmin and uses the previous solution as a "warm start" 
to approximate the next solution. Note that Xmin should be small enough to prevent the penalty 
from being so large that it causes the vector to be overly sparse. However, Xmin should also be 
positive and large enough to ensure a sparse solution that is robust to the training data. A robust 
solution includes terms for interactions that are inherent to the application generating the data, not 
interactions that are only figments the subset selected for training. It is not clear how to pick the 
penalty weight A to maintain sparsity of the solution and prevent overfitting while also capturing 
enough characteristics of the dataset. 

Here we use the rules generated in the previous experiment with Glmnet and build models using 
the coefficients that are generated at each step of the path A G [Xmin , oo] . Figure [9] shows how the 
accuracy of the method changes as the weight of the penalty used to find the coefficients changes. 
The solution with Glmnet when A is small results in slightly less error than the solution with 
Pathbuild when r is small. The variance in the error rates from solutions found with Pathbuild 
is less than the variance of error rates from solutions found with Glmnet. Both solutions yield 
false positive rates that are more than twice as large as the false negative rates; this is probably 
a result of the ratio of positive to negative observations in the test set is small. The error rate 
slowly decreases as A decreases, but then the error rate stabilizes when A is very small, < 0.01. It 
is interesting that the variance in error rates of the solutions is relatively constant as A changes. 
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Figure 9: Error rate decreases when Glmnet is used to solve for the coefficients and the constraint 
parameter A decreases. 

6.3 Rule Ensemble with SpglI 

In this experiment, we used the SpglI (sparse projected- gradient 11) Matlab™package [27] to 
solve for the coefficients a in 

min ||Xa — y||2 subject to ||a|[i <= a. (11) 

At each iteration of the algorithm, a convex optimization problem is constructed, whose solution 
yields derivative information that can be used by a Newton-based root-finding algorithm [28j . Each 
iteration of the SpglI method has an outer/inner iteration structure, where each outer iteration 
ffist computes an approximation to a. The inner iteration then uses a spectral gradient-projection 
method to approximately minimize a least-squares problem with an explicit one-norm constraint 
specified by a. Some advantages of the SpglI method are that only matrix-vector operations are 
required and numerical experience has shown that it scales well to large problems. 

The results using SpglI are shown in Figure [TOl The accuracy of the SpglI solution increases 
when a increases. The error rates are similar to those found by Pathbuild and Glmnet, but 
slightly higher than Glmnet even when a is large. 



6.4 Rule Ensemble with Fpc 

In this experiment, we used a fixed point continuation method (Fpc) [17] that approximates the 
solution a in 

min llalli + - * llXa - yllo. (12) 

aGR" 2 

This problem formulation seeks to minimize the weighted sum of the norm of the coefficients and 
the error of the solution, the left and right terms respectively. The sparsity of a is controlled by the 
size of the weighting parameter fi. Increasing fi places more importance on minimizing the error, 
and reduces the ratio of the penalty to the error. The reduction of penalty importance allows more 
coefficients to become non-zero (the ii norm of the coefficients to increase) and thus find a closer 
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Figure 10: Error rate decreases when SpglI is used to solve for the coefficients and the constraint 
parameter a increases. 



fit to the problem. Equation ()12p is simply a reformulation of problem ([6]) with the lasso penalty, 
and is referred to as a basis pursuit problem in signal processing. The relation of the two problems 
can clearly be seen if, for any A value, is chosen to be 

2 

and equation (fT2|) is multiplied by A. Fpc was developed for compressing signals by extracting the 
central components of the signal. 



Fpc exploits the properties of the I2 norm and declares three equivalent conditions for reaching an 
optimal solution. Fpc uses the reformulations of the optimality conditions to declare a shrinkage 
operator Siy, where is a shrinkage parameter that has both an effect on the speed of convergence 
and how many non-zero entries a* has. The operator s^, acts on a supplied initial value a'^ (which 
we chose to be the null solution) and finds our solution a* through a fixed point iteration 

a* = Su{a*). 

The given condition for the threshold of Si/ is 

if u - \y\ > then s„{y) 0. 

Fpc forms a path of solutions that starts with fj, initialized to ^ = n^^yn — (where ?? is a ratio of 
possible optimal square error at the next step to the square error at the current step). The param- 
eter n is altered at each step, which forces the shrinkage parameter to expand and contract but 
the upper bound for fi is supplied by the user. All results presented here use Fpc with projected 
gradient steps and optionally using a variant of Barzilai-Borwein steps [2]. 



The results of solutions generated by Fpc are shown in Figure [IT] They are roughly as accurate as 
the solutions generated with the previous solvers. Fpc also has an explicit display of the thresh- 
olding as seen in Figure [131 ttis norm of the coefficients increases dramatically then asymptotically 
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Figure 11: Error rate decreases when Fpc is used and the weight on the risk or mean squared error 
is increased (/i increased). 

approaches a certain value. The asymptotic behavior is caused by the threshold constricting the 
coefficients and essentially preventing another coefficient from stepping off of zero. The threshold- 
ing is also seen in the error rate decreases as the weight on the mean squared error is increased, 
but stabilizes once the training set is reasonably fit. The value of fi where the error stabilizes is 
the value needed to build the model, but unfortunately it is not clear how to choose this value of 
H a priori. The need for a selection of the penalty parameter is one of the difficulties that Fpc, 
SpglI, and Glmnet have. Pathbuild shares a similar problem with the need to selection the 
gradient descent constriction parameter r. 

6.5 Identifying Important Attributes Via Rule Importance 

Figure [TT] shows that the rule ensemble method is quite successful at correctly classifying observa- 
tions when all of the attributes are used to generate rules and build the model. Attributes have 
variable importance in the model and we suspect that not all of the 39 attributes in the full dataset 
are needed to model and correctly predict class labels. We want to use the rule ensemble method 
to select only the attributes that are important and save the expense of considering the other less 
important variables. 

The importance of a rule is indicated by the magnitude of the coefficient for that rule. The larger a 
coefficient is in magnitude, the more important the corresponding rule is, since that rule will have a 
larger contribution to the model. To sift out the most important attributes, we look at which rules 
Fpc considered important at different values of ^. Rules are ordered by the magnitude of their 
corresponding coefficient and the rules corresponding to the 20 largest (in magnitude) coefficients 
are selected. An example of ordering the rules is in Table [3] where the 5 most important rules from 
one test are ordered. This process is continued for 5 different repetitions of training and testing, 
which yields 5 sets of 20 most important rules. The sets of rules are decomposed into sets of at- 
tributes that are used to make up the rules in each set. Then we let the 5 repetitions vote on which 
attributes are influential and keep only attributes that are in the set of important attributes for at 
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Figure 12: There is little fluctuation in the overall error rate when Fpc is used on rules that were 
built with different size trees. Only the mean of cross validation tests is plotted here for simplicity. 
Little fluctuation implies that simpler rules, which come from smaller trees, can be used to build a 
model without sacrificing predictive capability. 
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Figure 13: Sparsity of solution is indicated by the number of coefficients not equal to zero. As fi 
is increased, the solution becomes less penalized and more coefficients step off zero and allow more 
terms to be included in the model. The sparsity of the solution stops decreasing when /i is large 
and the penalty is relatively small compared to the emphasis on minimizing the risk or second term 
in equation (jl2p . Here 78% of the coefficients are trivial when fi = 0.19. 
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least 3 out of the 5 repetitions. Figure O shows how many votes the highest ranking rules get and 
indicates that certain rules are important in all solutions while others are considered important in 
only some solutions. This set of attributes forms a smaller subset of the 39 attributes available in 
the initial dataset. The subset of rules only contains attributes that are used in at least one of the 
20 most important rules in at least 3 of the 5 repetitions. 

The importance of a rule is indicated by the magnitude of the coefficient for that rule. The larger 
a coefficient is in magnitude, the more important the corresponding rule is, as that rule will have a 
larger contribution to the model. To sift out the most important attributes, we look at which rules 
Fpc considered important at different values of fi. Rules are ordered by the magnitude of their 
corresponding coefficient and if a rule is one of the top 20 most important in a solution generated 
with a certain ^ (13 values of fj, we considered), then that rule receives a vote. An example of 
ordering the rules is in Table [3] where the 5 most important rules from one test with a given /i are 
ordered. Figure [14] shows for how many values of ^ each rule was considered to be in the top 20 
most important; this indicates that certain rules are important in solutions with all values of /x tried 
while others are considered important only when certain ^ are used. This process is continued for 
5 different cross-validation sets, which yields 5 sets of rules that were in the top 20 most important 
rules for at least one value of /x. The sets of rules are decomposed into sets of the attributes that 
were used to make up the rules in each set. Then we let the 5 repetitions vote on which attributes 
are needed to make the most influential rules and keep only the attributes that are in the set of 
important attributes for at least 3 out of the 5 repetitions. This set of attributes forms a smaller 
subset of the total attributes available in the initial dataset; it is the subset attributes that are used 
in at least one of the most important rules in at least 3 of the 5 repetitions. 

For the supernova dataset, the smaller subset of attributes included only 21 of the 39 original at- 
tributes. Tests were repeated using the same cross-validation sets and method parameters as were 
used in Figure \TT\ but using only the smaller subset of 21 attributes to train on rather than all 39 
attributes. Figure [15] compares the error rate of the method when 21 attributes were used with 
the error rate of the method when all 39 attributes were used. The results show that the accuracy 
of the method improves when we reduce the number of attributes used in the model. The method 
successfully ranks rules and identifies more important attributes. The method loses accuracy when 
the less important features are included; in essence, the extra attributes act as noise. After the 
method identifies these attributes as less important and we remove them, the method is able to 
return an even more accurate model and the insight of which attributes are not adding predictive 
capability to the model. Garnering better accuracy with fewer attributes may allow the extra at- 
tributes to be excluded from the data collection, which will save time in collecting data, save space 
in storing data, and allow an overall better analysis. 



7 Conclusions 

We compared several variations of a rule ensemble method with some well-known tree ensemble 
methods, namely boosting and bagging, on a variety of multi-class problems. We extended the rule 
ensemble to work on multi-class problem by using the OVA technique and found that with this ex- 
tension the rule ensemble method performed comparably to thetree methods on a set of lOclassical 
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Figure 14: This histogram shows how many times a rule was one of the top 20 most important rules 
in a solution. A solution was generated at each of 13 different values of as shown in Figure [TTl 
Rules that received 13 votes were one of 20 most influential rules for every value of fi tried. Only 
rules that were in the top 20 most influential for at least one solution are shown. The attributes 
that were used in the rules shown here were used to find a smaller subset of attributes to train on 
for the results in Figure [TSl 



Rule Tfc \ak\ 

X2 > -0.315 & xis > 0.047 0.1045 

X29 <-0.251 0.0725 

X23 > -0.606 0.0317 

xi < -0.324 0.0274 

xi2 > 0.260 0.0193 

Table 3: Example of ordering rules by importance. These are the five rules with greatest importance 
in the first model as selected by Fpc with fi = 0.25. 
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Figure 15: Comparison of overall error rate when fewer attributes are used. The preliminary tests 
used all 39 attributes in the dataset. The subsequent tests used only the subset of 21 attributes that 
were used to construct the most important rules in the preliminary tests. Using the restricted set of 
attributes gives a lower error rate indicating that the rule ensemble method successfully identified 
which attributes are important in the dataset. 



datasets. This result highlights the power of the rule ensemble method, as we had expected the 
tree ensemble methods to do better on multi-class problems. Tree ensembles can use multi-class 
decision trees, which provide what one would think is a more natural extension to multi-class prob- 
lems than using the OVA method. However, the rule ensemble method returned comparable rates 
of accuracy on most datasets and even performed better on some of the datasets. The discrepancy 
between the tree ensembles with voting and the rule ensemble was larger on problems that had a 
relatively large number of labels, such as the pendigits dataset, which had the most labels out of 
all the datasets, than on datasets with fewer labels. To improve the accuracy of the rule ensemble 
on problems with many classes, we would like to try using multi-class decision trees to build the 
rules and then relabel the nodes for each binary problem. This technique might yield better rules 
as it would allow for differentiation between the classes in the rule building phase. Better rules 
would then allow for a clearer separation of binary labels in the regression phase. This technique 
would also make the training phase more efficient as it would only require one set of rules to be 
constructed rather the as many sets of rules as there are classes. 

We also looked at using 4 different methods to find coefficients to assemble the rules. All 4 methods 
present the challenge of needing to select a constraint parameter that controls the sparsity /accuracy 
trade-off of the solution that they return. If each parameter is chosen correctly then the methods 
are capable of producing coefficients that allow for similar accuracy in the model. The different 
approaches that the methods take for finding the coefficients do result in slightly different rank- 
ings of the rules. The difference in coefficients that each method considers important is shown in 
Figure [161 Ideally all solvers would select the same terms to be the most significant and would 
order the terms by importance the same way. Figure [16] shows that some rules that one method 
considers important are not considered to be important to another method. FPC and SpglI order 
coefficients similarly, which is indicated by SpglI giving a significant magnitude to coefficients 
that Fpc also gives a significant magnitude to. Glmnet's and Pathbuild's ordering share less 
similarity with Fpc and SpglI as indicated by coefficients such as 9 and 18 that Glmnet and 
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Figure 16: Length of bars indicate the magnitude of coefficients as calculated by different solvers. 
Only the coefficients with the 10 largest magnitudes from each solver are displayed. Coefficients 
plotted come from solutions that yielded similar error rates: r = 0.4 , fi = .11, a = 8.5, A = 0.014. 



Pathbuild give a significant magnitude to, but both FPC and SpglI give trivial values to. The 
difference in methods is also reflected in the sparsity of the solutions that they return. To achieve 
similar accuracy (taken here at 96% accuracy) Pathbuild returns a solution with 40-50% of the 
coefficients non-zero while the other methods return much sparser solutions that have only 12-19% 
of the coefficients non-zero. In general, SpglI returned the sparsest solutions and Pathbuild 
returned the least sparse solutions for models with similar error rates. 

As a final step, we showed the utility of the rule ensemble method for identifying important at- 
tributes in a dataset containing images of potential supernovas. The rule ensemble method has the 
benefit over tree methods of providing insight into a dataset by returning weighted rules. Rules 
with large weights have a larger effect on the model and thus can be thought of as more important 
than other rules. We used the importance of such rules to alert us to the more significant features 
in the dataset by looking at which features the important rules are defined on. This technique 
allowed us to select 21 attributes out of the 39 available and reduce the error rate of the model by 
building models only on the reduced set of attributes. Traditional algorithms that use ensembles of 
decision trees, such as boosting and bagging, aren't able to provide this insight into the importance 
of certain variables of a dataset because they do not rank or weight of rules. 

The rule ensemble method has the advantage over some other methods by being able to identify 
relationships and hierarchies between variables to a certain extent when building the decision trees. 
The rules in the decision trees get more complex the deeper the tree is grown and also are able to 
have limited support in the parameter space, so they only affect certain observations that fall in that 
space. By including more variables, complex rules can be seen as resembling discrete correlations, 
and the post-processing of the rules allows for overly simplified correlations (that precede more 
complex rules in depth) to be removed from the model. The post-processing also allows for overly 
complex rules to be pruned from the model. Thus some variable interactions can be captured by 
the rule ensemble method without any a priori assumption that they exist, as is needed in standard 
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regression models, and excessive computation is not spent considering correlations that do not exist. 

We do not compare the computational efficiency of the rule ensemble method with tree ensem- 
blemethods here, since it is currently written in Matlab^^\ while the tree ensemble methods used 
are written in C. However, we do not expect that the rule ensemble method will reduce the amount 
of time necessary for the training portion of the algorithm to run because it must perform the 
coefficient solving method in addition to the tree growing. If the rule ensemble method is able to 
prune a substantial number of repetitive or unnecessary rules, thenit is likely to run substantially 
more quickly than the tree methods. Comparing the time efficiency of the rule ensemble with other 
tree methods and other machine learning techniques will be part of future work. We do not present 
the computationalefficiency of the coefficient solving methods used inthe rule ensemble method for 
the same reason. Each solver is written in adifferent programming language, and each will have to 
be implemented in the same language and level of optimization before a meaningful study can be 
performed. 
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Here we discuss the gradient method Pathbuild, which is described in section [3l in greater detail. 
Simphfications of the gradient method are presented and considered as the "fast method" . 



A Derivation of the Negative Gradient of Risk 

The negative gradient g G of the loss on the observations is found by taking partial derivatives 
of the sum of the loss on each observation with respect to each coefficient. The components of the 
negative gradient are given by 



d 1 ^ 



9k 



1=1 



Y,L{yi,F{^,)), (13) 



where k = 1, . . . , JT. Note that qq = Q as oq is the constant intercept that minimizes the risk when 
F{xi) = oo and all the other coefficients have not moved off their initial zero value, {g^, k = 1..K} 
are the non trivial components of the gradients. 

JL,e,.,f,..), = M-))^?a^. (u, 

dak dF{xi) dak 

Note that the second term is easily computed from the linear form of F{xi) and is given by 

dF{x.) 

= Xik- (15) 

oak 



A.l Negative gradient squared error ramp loss is used 

The previous discussion has been generalized for the use of any loss function L{-). Now consider 
the case when the loss function is given by 

L(y„F(x,)) = [y.-/?(F(x,))]2 
/7(F(x,)) =max[-l,min(l,F(x,))], ^ ^ 

which is the squared error ramp loss for the i-th observation. We want to find the derivative with 
respect to a for this loss function. Begin by taking a partial derivative with respect to F 

-^L{y„F{x,)) = -2(y, - F(x,)) /(|F(x,)| < 1). (17) 

Substitute (jlSp and (DrampDF) into (jl4p to get the derivative for the squared error ramp loss 

^Livi, F(x,)) = -2{yi - F(xi)) x^k /(|F(x.)| < 1). (18) 
oak 

Using the form of F{xi) in the partial derivative (fT8]l and then substituting into (fT3]l . we get the 
gradient for the risk using the squared error ramp loss function 




9k = — ? . \ Vi - ^0 - ^ ajXij j Xifc/(|F(xi)| < 1). 



(19) 
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Rearranging, switching the order of summation, and evaluating at the £-th step in the approxi- 
mation of a we can write the gradient at the ^-th step as 



N 



i=l 



2 ^ 

-ao^J]x,fc/(|F^(x,)|<l) 

i=l 



(20) 
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XijXikI{\F\x,)\ < 1) 



i=l 



A. 2 Negative gradient with auxiliary functions v, u 

We need to keep track of the dependencies and update properly at each iteration. The goal of the 
method is to update the coefficients a. We take a step with respect to a and then update everything, 
so let a act as the independent variable. Recall that i is the index over the observations so Xi is the 
attribute values for the i-th observation and Fj is the predicted value for that observation. This 
leaves us with 

K 

F\-Ki) = ao + ^ 4 Xik = Fi{a^). 

k=l 

Defining the indicators 

vl = v,{a') = I{\F,{a')\<l), 



we can define a new function by 

2 ^ 

u\p,q)=u{v^;p,q) = —^Piqivl (21) 

i=l 

where p and q are scalars and = /(|F^(xj)[ < 1). Using the two functions v,u the negative 
gradient at the i-th. step (|20p can be written in a simpler form 

K 

gi = u\v^]y,Xk) - aQu\v^;l,Xk) - ^ a]u^{v^-Xj,Xk). (22) 



B Fast Algorithm 

To "step" we move proportional to the largest component of the negative gradient (jl3p . Let (^j* be 
the largest absolute component of the gradient 

7* = arg max \q^A 
l<j<K ^ 
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at the £-th step. Then call the length of the next step 5* = A,^ g -, and update the coefficients with 
a^^^ = aj* + 5*. The coefficients at the {£ + l)-th step are 

[a'j.+d^ ii3=3*. 

After a step the gradient must be recomputed before another step can be taken. Rather than fully 
recomputing an update can be applied only to the components of the gradient that are affected by 
the step. There are two cases of how the update to the gradient can be made. One update occurs 
when the step in the coefficients has caused indicator functions to change; this update requires 
more work and is expensive. The other update is cheap and is given as follows. 



B.l Case when indicators do not change 

The step size should be small; in practice it is taken to be 0.01. The idea is that with a small 
stepsize |-F(xj)| will not exceed 1 "often." On the steps where this is true the indicators do not 
change so v^, u{v^; y, Xk),u{v^; Xj,Xk) do not change and the negative gradient at the (£ + l)-th step 
is found by substituting (f23]l into (f22]) 



K 

£+1 it \ f / f -> \ III \ I I \ 

= u{v ■,y,Xk) - aou [v ■,l,Xk) - ajU{v ■,Xj,Xk) -d^u{v'-;xj*,Xk) , . 

j=i ^^^> 
= 9k -5^u{v^;xj'>,Xk). 

B.2 Case when indicators change - adjustments 

If the assumption fails and the indicators change on a step, then ^ v^~^^ u^{v^) ^ u^~^^{v^) 
and (I24p does not hold. To find g^^^ , consider the cases of how v can change and and define the 
variable 

' -1, if = 1 and = 

0, \ivl = vl+^ (25) 
^+1, if 7;f = and7;^^+^ = 1. 

Zn can be thought of adding in observations where the indicators have turned on and subtracting 
observations where indicators have turned off. Using z„, u can be adjusted 

u{v^+^;y,Xk) = u{v^;y,Xk) +^ ZnynXnk 

u{v^+'^;xj,Xk) = u{v^;xj,Xk) +^ ^ a^nj a;„fc (26) 

2 ^ 

u{v^-^'^;l,Xk) = u\v^;l,Xk) +— ZnXnk 
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and used with (p3|) and p9|) to find the £-th. update of the negative gradient 

K 
K 

= u{v^;y,Xk) -aou{v^;l,Xk) - ^ xj, Xfe) (27) 

2 

With a httle more rearrangement the update to the gradient as 

g^^^ = adjust for obs. with changed I{\Fi\ < 1) 

+— ZnUnXnk Update u^{y,Xk) from indicator change 

2 



-oo — Y2 ^nXnk Update u{v^~^^; l,Xfc) from indicator change 

-5^u{v^Xj*,Xk) step in j* direction 



(28) 



e 



2 ^ 

—5*— > ZnXnj'Xnk terms not included in update due to old v 

N ^ 

2 ^ 

— — aj ZnXnj'Xnk adjust for observations with changed < 1). 

j=l z„^0 
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